In [ ]:

    
%%html
<link rel='stylesheet' type='text/css' href='custom.css'/>



In [ ]:

    
!rm data/converted-seqs.fasta data/converted-seqs.qual data/not-yasf.fna



In [ ]:

    
%matplotlib inline
import matplotlib.pyplot as plt

def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues):
    plt.figure()
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    plt.ylabel('Known taxonomy')
    plt.xlabel('Predicted taxonomy')
    plt.tight_layout()
    plt.show()

A Bioinformatics Library for Data Scientists, Students, and Developers

Jai Rideout and Evan Bolyen

Caporaso Lab, Northern Arizona University

What is scikit-bio?

A Python bioinformatics library for:

data scientists

students

developers

"The first step in developing a new genetic analysis algorithm is to decide how to make the input data file format different from all pre-existing analysis data file formats." - Law's First Law

Axt BAM SAM BED bedGraph bigBed bigGenePred table bigWig Chain GenePred table GFF GTF HAL MAF Microarray Net Personal Genome SNP format PSL VCF WIG abi ace clustal embl fasta fastq genbank ig imgt nexus phred phylip pir seqxml sff stockholm swiss tab qual uniprot-xml emboss PhyolXML NexML newick CDAO MDL bcf caf gcproj scf SBML lsmat ordination qseq BIOM ASN.1 .2bit .nib ENCODE ...

I/O in bioinformatics is hard

format redundancy (many-to-many)

format ambiguity

heterogeneous sources

How can we solve this?

An I/O Registry!

Format redundancy (many-to-many)



In [ ]:

    
from skbio import DNA

seq1 = DNA.read('data/seqs.fasta', qual='data/seqs.qual')
seq2 = DNA.read('data/seqs.fastq', variant='illumina1.8')
seq1



In [ ]:

    
seq1 == seq2

Format ambiguity



In [ ]:

    
import skbio.io

skbio.io.sniff('data/mystery_file.gz')

Heterogeneous sources

Read a gzip file from a URL:



In [ ]:

    
from skbio import TreeNode

tree1 = skbio.io.read('http://localhost:8888/files/data/newick.gz', 
                      into=TreeNode)
print(tree1.ascii_art())

Read a bz2 file from a file path:



In [ ]:

    
import io 

with io.open('data/newick.bz2', mode='rb') as open_filehandle:
    tree2 = skbio.io.read(open_filehandle, into=TreeNode)

print(tree2.ascii_art())

Read a list of lines:



In [ ]:

    
tree3 = skbio.io.read(['((a, b, c), d:15):0;'], into=TreeNode)
print(tree3.ascii_art())

Let's make a format!

YASF (Yet Another Sequence Format)



In [ ]:

    
!cat data/yasf-seq.yml



In [ ]:

    
import yaml

yasf = skbio.io.create_format('yasf')

@yasf.sniffer()
def yasf_sniffer(fh):
    return fh.readline().rstrip() == "#YASF", {}

@yasf.reader(DNA)
def yasf_to_dna(fh):
    seq = yaml.load(fh.read())
    return DNA(seq['Sequence'], metadata={
        'id': seq['ID'],
        'location': seq['Location'],
        'description': seq['Description']
    })



In [ ]:

    
seq = DNA.read("data/yasf-seq.yml")
seq

Convert YASF to FASTA



In [ ]:

    
seq.write("data/not-yasf.fna", format='fasta')
!cat data/not-yasf.fna

We are in beta - should you even use our software?

YES!

API Lifecycle



In [ ]:

    
from skbio.util._decorator import stable

@stable(as_of='0.4.0')
def add(a, b):
    """add two numbers.
    
    Parameters
    ----------
    a, b : int
        Numbers to add.
        
    Returns
    -------
    int
        Sum of `a` and `b`.
    
    """
    return a + b



In [ ]:

    
help(add)

What is stable:

skbio.io
skbio.sequence

What is next:

skbio.alignment
skbio.tree
skbio.diversity
skbio.stats
<your awesome subpackage!>

Sequence API: putting the scikit in scikit-bio



In [ ]:

    
seq = DNA("AacgtGTggA", lowercase='exon')
seq

Made with numpy



In [ ]:

    
seq.values

And a pinch of pandas



In [ ]:

    
seq.positional_metadata

Slicing with positional metadata:



In [ ]:

    
seq[seq.positional_metadata['exon']]

Application: building a taxonomy classifier



In [ ]:

    
aligned_seqs_fp = 'data/gg_13_8_otus/rep_set_aligned/82_otus.fasta'
taxonomy_fp = 'data/gg_13_8_otus/taxonomy/82_otu_taxonomy.txt'



In [ ]:

    
from skbio import DNA

fwd_primer = DNA("GTGCCAGCMGCCGCGGTAA",
                 metadata={'label':'fwd-primer'})
rev_primer = DNA("GGACTACHVGGGTWTCTAAT",
                 metadata={'label':'rev-primer'}).reverse_complement()



In [ ]:

    
def seq_to_regex(seq):
    result = []
    for base in str(seq):
        if base in DNA.degenerate_chars:
            result.append('[{0}]'.format(
                ''.join(DNA.degenerate_map[base])))
        else:
            result.append(base)

    return ''.join(result)

regex = '({0}.*{1})'.format(seq_to_regex(fwd_primer),
                            seq_to_regex(rev_primer))



In [ ]:

    
import numpy as np
import skbio

starts = []
stops = []
for seq in skbio.io.read(aligned_seqs_fp, format='fasta', 
                         constructor=DNA):
    for match in seq.find_with_regex(regex, ignore=seq.gaps()):
        starts.append(match.start)
        stops.append(match.stop)
        
locus = slice(int(np.median(starts)), int(np.median(stops)))
locus



In [ ]:

    
kmer_counts = []
seq_ids = []
for seq in skbio.io.read(aligned_seqs_fp, format='fasta',
                         constructor=DNA):
    seq_ids.append(seq.metadata['id'])
    sliced_seq = seq[locus].degap()
    kmer_counts.append(sliced_seq.kmer_frequencies(8))



In [ ]:

    
from sklearn.feature_extraction import DictVectorizer
X = DictVectorizer().fit_transform(kmer_counts)



In [ ]:

    
taxonomy_level = 3 # class
id_to_taxon = {}
with open(taxonomy_fp) as f:
    for line in f:
       id_, taxon = line.strip().split('\t')
       id_to_taxon[id_] = '; '.join(taxon.split('; ')[:taxonomy_level])

y = [id_to_taxon[seq_id] for seq_id in seq_ids]



In [ ]:

    
from sklearn.feature_selection import SelectPercentile

X = SelectPercentile().fit_transform(X, y)



In [ ]:

    
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=0)



In [ ]:

    
from sklearn.svm import SVC

y_pred = SVC(C=10, kernel='linear', degree=3,
             gamma=0.001).fit(X_train, y_train).predict(X_test)



In [ ]:

    
from sklearn.metrics import confusion_matrix, f1_score

cm = confusion_matrix(y_test, y_pred)
cm_normalized = cm / cm.sum(axis=1)[:, np.newaxis]
plot_confusion_matrix(cm_normalized, title='Normalized confusion matrix')

print("F-score: %1.3f" % f1_score(y_test, y_pred, average='micro'))

Acknowledgements

scikit-bio development team

Funding

Alfred P Sloan Foundation
National Science Foundation
National Institutes of Health
Arizona Board of Regents Technology and Research Investment Fund

The Caporaso Lab is hiring postdocs and developers, find us if you want to get paid to work on scikit-bio!

We're having a sprint on Saturday and Sunday!